AITopics | reward critic

Collaborating Authors

reward critic

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Large Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement Learning

Xie, Guanwen, Xu, Jingzehua, Yang, Yiyuan, Zhang, Shuai

arXiv.org Artificial IntelligenceSep-4-2024

Leveraging large language models (LLMs) for designing reward functions demonstrates significant potential. However, achieving effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we enable LLMs to be effective white-box searchers, highlighting their advanced semantic understanding capabilities. Specifically, we generate reward components for each explicit user requirement and employ the reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively search and optimize these weights based on the context provided by the training log analyzer, while adaptively determining the search step size. We applied the framework to an underwater information collection RL task without direct human feedback or reward examples (zero-shot). The reward critic successfully correct the reward code with only one feedback for each requirement, effectively preventing irreparable errors that can occur when reward function feedback is provided in aggregate. The effective initialization of weights enables the acquisition of different reward functions within the Pareto solution set without weight search. Even in the case where a weight is 100 times off, fewer than four iterations are needed to obtain solutions that meet user requirements. The framework also works well with most prompts utilizing GPT-3.5 Turbo, since it does not require advanced numerical understanding or calculation.

language model, requirement, reward function, (14 more...)

arXiv.org Artificial Intelligence

2409.02428

Country:

Europe > United Kingdom > England > Oxfordshire > Oxford (0.14)
North America > United States > New Jersey (0.04)
Asia > China > Guangdong Province > Shenzhen (0.04)

Genre: Research Report (0.40)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.52)

Add feedback

The Distributional Reward Critic Architecture for Perturbed-Reward Reinforcement Learning

Chen, Xi, Zhu, Zhihui, Perrault, Andrew

arXiv.org Artificial IntelligenceJan-11-2024

We study reinforcement learning in the presence of an unknown reward perturbation. Existing methodologies for this problem make strong assumptions including reward smoothness, known perturbations, and/or perturbations that do not modify the optimal policy. We study the case of unknown arbitrary perturbations that discretize and shuffle reward space, but have the property that the true reward belongs to the most frequently observed class after perturbation. This class of perturbations generalizes existing classes (and, in the limit, all continuous bounded perturbations) and defeats existing methods. We introduce an adaptive distributional reward critic and show theoretically that it can recover the true rewards under technical conditions. Under the targeted perturbation in discrete and continuous control tasks, we win/tie the highest return in 40/57 settings (compared to 16/57 for the best baseline). Even under the untargeted perturbation, we still win an edge over the baseline designed especially for that setting. The use of reward as an objective is a central feature of reinforcement learning (RL) that has been hypothesized to constitute a path to general intelligence Silver et al. (2021). The reward is also the cause of a substantial amount of human effort associated with RL, from engineering to reduce difficulties caused by sparse, delayed, or misspecified rewards Ng et al. (1999); Hadfield-Menell et al. (2017); Qian et al. (2023) to gathering large volumes of human-labeled rewards used for tuning large language models (LLMs) Ouyang et al. (2022); Bai et al. (2022).

conference paper, perturbation, reward critic, (14 more...)

arXiv.org Artificial Intelligence

2401.0571

Country:

North America > United States > Ohio > Franklin County > Columbus (0.04)
North America > United States > New York > New York County > New York City (0.04)

Genre: Research Report (0.40)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)

Add feedback